GitBucket
4.21.2
Toggle navigation
Snippets
Sign in
Files
Branches
1
Releases
Wiki
nigel.stanger
/
Wiki
Compare Revisions
View Page
Back to Page History
Transcribing lectures using Whisper.md
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. * An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.” * Normalising the audio beforehand may or may not help. The `speechnorm` filter in FFmpeg seems quite effective:, e.g.,`ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>`. For example: ```sh whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file> ``` Offline transcription is roughly real-time (i.e., 1 hour of audio takes about 1 hour to transcribe). Models are automatically downloaded to `~/.cache/whisper`. `whisper-cpp` is actually the one we want as it’s written in C++ and supports Core ML. Annoyingly the CLI options are different, but it seems to have more of them. Uses the same models as Vibe below. Only supports 16kHz WAV as input 🙁 (`ffmpeg -i <input> -vn -ar 16000 <output>` works for any input). Useful options: * `--offset-t` and maybe `--duration` to specify the start and end. Helpful to synchronise timestamps? `--offset-t` also helps if Whisper gets confused by lack of audio/speech at the start of a recording. * `--print-colors` to show confidence level? Hmm, really designed for dark theme… * `--tinydiarize` to identify speaker changes? (requires a `tdrz` model). Unclear how this related to `--diarize``. ```sh whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 --prompt "<prompt>" <input file> ``` **Much** faster: about 8×, e.g., 1 hour 45 minutes takes about 12 minutes. **Vibe** seems to be a useful cross-platform GUI implementation. Internally whisper-cpp ported to Rust. It claims to have a CLI, but I can’t figure out how to make it work. It produces malformed VTT: no WEBVTT header, and no blank lines between entries. VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic. ## Standard workflow 1. Download (low quality) primary audio/video recorded from Echo 360 (**primary**). 1. Download alternative audio from Zoom H2 recorder (**alt-original**). 1. Normalise **alt-original** audio and upload to Echo360 (**alt-normalised**): ```sh # normalise alt-original to alt-normalised (single part) ffmpeg -i input.wav -filter:a 'speechnorm=e=12.5' DATE-normalised.wav # normalise alt-original to alt-normalised (multiple parts) ffmpeg -i input1.wav -i input2.wav -filter_complex 'concat=n=2:v=0:a=1,speechnorm=e=12.5' DATE-normalised.wav ``` 1. Extract 16kHz audio from **primary** and **alt-normalised** as Whisper requires 16 kHz: ```sh # 16 kHz alt-normalised ffmpeg -i DATE-normalised.wav -vn -ar 16000 DATE-normalised-16khz.wav # 16 kHz primary ffmpeg -i INFO\ 408\ S2\ 2024\ Lec-s1-low.mp4 -vn -ar 16000 INFO\ 408\ S2\ 2024\ Lec-s1-low.wav ``` 1. Generate VTT from 16 kHz **primary** and **alt-normalised**: ```sh # from primary whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 INFO\ 408\ S2\ 2024\ Lec-s1-low.wav # from alt-normalised, used for corrections and context whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 DATE-normalised-16khz.wav ``` 1. Clean up the **primary** VTT. 1. Download Echo360 encoded version of **alt-normalised** (**alt-echo**). 1. Copy **primary** VTT to **alt-echo** VTT and adjust timings. 1. Upload **primary** and **alt-echo** VTTs to Echo360. 1. Delete everything except: **alt-original** audio, **primary** VTT, **alt-echo** VTT, Zoom meeting chat (if any).
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. * An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.” * Normalising the audio beforehand may or may not help. The `speechnorm` filter in FFmpeg seems quite effective:, e.g.,`ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>`. For example: ```sh whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file> ``` Offline transcription is roughly real-time (i.e., 1 hour of audio takes about 1 hour to transcribe). Models are automatically downloaded to `~/.cache/whisper`. `whisper-cpp` is actually the one we want as it’s written in C++ and supports Core ML. Annoyingly the CLI options are different, but it seems to have more of them. Uses the same models as Vibe below. Only supports 16kHz WAV as input 🙁 (`ffmpeg -i <input> -vn -ar 16000 <output>` works for any input). Useful options: * `--offset-t` and maybe `--duration` to specify the start and end. Helpful to synchronise timestamps? `--offset-t` also helps if Whisper gets confused by lack of audio/speech at the start of a recording. * `--print-colors` to show confidence level? Hmm, really designed for dark theme… * `--tinydiarize` to identify speaker changes? (requires a `tdrz` model). Unclear how this related to `--diarize``. ```sh whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 --prompt "<prompt>" <input file> ``` **Much** faster: about 8×, e.g., 1 hour 45 minutes takes about 12 minutes. **Vibe** seems to be a useful cross-platform GUI implementation. Internally whisper-cpp ported to Rust. It claims to have a CLI, but I can’t figure out how to make it work. It produces malformed VTT: no WEBVTT header, and no blank lines between entries. VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic. ## Standard workflow 1. Download (low quality) primary audio/video recorded from Echo 360 (**primary**). 1. Download alternative audio from Zoom H2 recorder (**alt-original**). 1. Normalise **alt-original** audio and upload to Echo360 (**alt-normalised**): ```sh # normalise alt-original to alt-normalised ffmpeg -i input1.wav -i input2.wav -filter_complex 'concat=n=2:v=0:a=1,speechnorm=e=12.5' DATE-normalised.wav ``` 1. Extract 16kHz audio from **primary** and **alt-normalised** as Whisper requires 16 kHz: ```sh # 16 kHz alt-normalised ffmpeg -i DATE-normalised.wav -vn -ar 16000 DATE-normalised-16khz.wav # 16 kHz primary ffmpeg -i INFO\ 408\ S2\ 2024\ Lec-s1-low.mp4 -vn -ar 16000 INFO\ 408\ S2\ 2024\ Lec-s1-low.wav ``` 1. Generate VTT from 16 kHz **primary** and **alt-normalised**: ```sh # from primary whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 INFO\ 408\ S2\ 2024\ Lec-s1-low.wav # from alt-normalised, used for corrections and context whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 DATE-normalised-16khz.wav ``` 1. Clean up the **primary** VTT. 1. Download Echo360 encoded version of **alt-normalised** (**alt-echo**). 1. Copy **primary** VTT to **alt-echo** VTT and adjust timings. 1. Upload **primary** and **alt-echo** VTTs to Echo360. 1. Delete everything except: **alt-original** audio, **primary** VTT, **alt-echo** VTT, Zoom meeting chat (if any).